Open-source IRT: A comparison of BILOG-MG and ICL

نویسندگان

  • Alan D. Mead
  • Scott B. Morris
  • David L. Blitz
چکیده

BILOG is the defacto standard for dichotomous IRT model estimation. However, BILOG is a commercial product and limited in other ways. Hanson provides an open-source alternative, ICL, and this paper compares ICL to BILOG in terms of features and obtained item parameter estimates. In general, BILOG has more features, especially with respect to assessing model-data fit. One notable feature of ICL is built-in support for bootstrap estimates. ICL and BILOG produced very, very similar estimates of item parameters. ICL BILOG Comparison 3 Open-source IRT: A comparison of BILOG-MG and ICL features and item parameter recovery In their seminal article introducing BILOG and comparing it to LOGIST, Mislevy and Stocking (1989) noted the central role of estimation software for researchers interested in using item response theory (IRT). In the intervening 18 years, BILOG-MG has become the defacto standard for estimating the parameters of dichotomous IRT models (Rupp, 2003). This paper compares BILOG-MG (Zimowski, Muraki, Mislevy & Bock, 2003; hereafter designated BILOG) and ICL (Hanson, 2002a; 2002b), a relatively new program for estimating the parameters of IRT models. The proposed talk has three goals: (a) to introduce Hanson's (2002a) ICL software; (b) to describe common and unique features of ICL and BILOG so that users may choose the best program for a given application; and (c) to demonstrate that the two programs are equivalent in accomplishing their primary task of estimating item parameters. Introducing ICL. Recently, Hanson (2002a) released stand-alone software for estimation of IRT model parameters, called IRT Command Language (ICL). ICL is free , open-source software (e.g., Raymond, 1999) similar to the popular R statistical software (de Leeuw & Mair, 2007) and is licensed in a way that allows it to be modified and extended. In fact, ICL is actually IRT estimation functions (ETIRM; Hanson, 2000) embedded into a fully-featured programming language called Tcl (“tickle”; Welch, Jones & Hobbs, 2003) and thus allowing relatively complex operations. ICL is available for Windows, Macintish and Linux. Comparing ICL and BILOG-MG One reason for BILOG’s universal acceptance was its introduction of marginal maximum likelihood (MML) estimation in a Bayesian framework (Bock & Aitkin, 1981; Mislevy, 1986). The MML estimation algorithms introduced in BILOG were a significant statistical and practical advance over the programs available then, especially the popular LOGIST program (Mislevy & Stocking, 1986; Lord, 1980). Other reasons for BILOG’s wide usage are very practical. BILOG has many features that are designed for applied work. For example, BILOG will read and score raw data in many different formats and BILOG allows flexibility regarding the estimation from data of complex sampling schemes. BILOG has also enjoyed professional support, including a well-written manual and technical support from the publisher. The program has been maintained and its features have expanded. Today, BILOG comes with a Windows-based “shell” program that allows users to build command syntax from menus and pull-down lists. BILOG's ‘-MG’ suffix indicates the version which handles estimation in a multigroup situation (Bock & Zimowski, 1996) and is now the only generally available version. ICL BILOG Comparison 4 A natural question is the degree to which ICL is similar to BILOG. This “apples and oranges” comparison defies simple lists of similar and separate features. Many of the features directly available in BILOG are available indirectly in ICL (but must be programmed). The ICL manual (Hanson, 2002b) is particularly helpful, providing a series of example ICL command files which accomplish various tasks. We will confine our comparison of BILOG-MG and ICL to the features documented in their respective manuals (including examples) rather than assuming any special Tcl knowledge on the part of the ICL user. Estimation. Both BILOG and ICL provide maximum marginal likelihood (MML; Bock & Aitkin, 1981) estimation via the EM algorithm (McLachlan & Krishnan, 1997; Dempster, Laird, & Rubin, 1977) and both implement a Bayesian framework (Mislevy, 1986). Many of the details of the estimation in ICL are provided in Woodruff and Hanson (1997) and Hanson (1998). One significant difference is that ICL does not implement Fischer scoring in the same way as BILOG; BILOG users will notice this in two ways. First, there are no “Newton cycles” following the “EM cycles” during estimation. And second, ICL does not compute the item variancecovariance matrix. As an alternative for users requiring the item variances or covariances, ICL implements a bootstrapping feature which can be used to generate data that can be analyzed with SAS or SPSS to estimate these values. Estimation options. Both programs allow the user to influence parameter estimation, such as setting priors on structural parameters and specifying the maximum numbers of estimation cycles. ICL provides a larger selection of options for expert users, although many of these options may not be useful for the average user. ICL also provides more flexibility regarding the prior distributions for item parameters (beta, normal, and log-normal are available). Model fit information. Both programs provide information about convergence but otherwise BILOG provides considerably more information about model fit. BILOG includes provides chisquare indices of the fit of individual items (or residual information for short tests), summaries of the estimates, and “fit plots” which overlay the empirical proportion correct for various “bins” of theta-hat values upon the ICC. This is a significant practical advantage for BILOG. Documentation. Both BILOG and ICL have well-written manuals. The manual describing BILOG (du Toit, 2003) also chapters providing an overview of the estimation procedures, other IRT programs from SSI, and a chapter of historical material. The chapter on BILOG contains some introductory information, documentation of the commands and options, examples, and file formats. The ICL manual (Hanson, 2003b) is similar to the BILOG chapter of the SSI manual, providing introductory information, documentation of the commands, and examples. The documentation is divided between the basic commands required for a default single-group dichotomous model estimation and more advanced commands. Both programs come with PDF copies of the documentation, allowing easy searching for information. The ICL manual would profit from the addition of an index. Data processing. Both programs provide data processing options. For example, BILOG will score raw multiple-choice options. ICL does not provide this capability directly. Also, missing ICL BILOG Comparison 5 responses are always ignored by ICL, while BILOG provides three models for missing data: wrong, partially right, and ignored. Item parameter recovery study Although one previous study of on-line calibration (Ban, Hanson, Wang, & Harris, 2001) found similar results using ICL and BILOG, no previous research has been reported which directly compares the abilities of the two programs to recover item parameters. To directly address this issue of the comparability and accuracy of the item parameter estimates produced by the two programs, we conducted an item-parameter recovery study using simulated data. We hypothesized that BILOG might have implementation decisions (either statistical or logical) that resulted in better estimation for very small or large samples. Therefore we generated samples of three different sizes: N=100; N=1,000, and N=50,000. For each condition, items responses were generated for a 50 item test based on operational item parameters from a highstakes certification exam. The true, generating parameters of the IRT model for each item are given in Table 1. The procedure was replicated five times. [We considered a larger number of replications; however, the results were remarkably stable.] Method Item responses were generated using a Fortran 90 program. The true ability of the examinees was generated from a standard normal distribution with M = 0 and SD = 1 using the IMSL (1984) pseudo-random number generator DRNNOR. Next, the probability of a correct response to each item was computed for each examinee as a function of the individual’s ability according to a 3PL IRT model. Then, the individual’s response to each item was computed by generating a uniform random number between 0 and 1 using the IMSL DRNUN routine, and assigning a score of 1 if this number was less than the individual’s probability for that item, and a score of 0 otherwise. Results The main outcome variable was the root mean square error (RMSE) between ICCs. We chose to compare differences in ICCs because for some items, different values of item parameters can yield very similar ICCs. We computed the RMSE using 41 points from -3.0 to 3.0. Three comparisons were made: • ICC’s computed from ICL estimates compared to ICC’s computed from the true, generating parameters • ICC’s computed from BILOG estimates compared to ICC’s computed from the true, generating parameters • ICC’s computed from ICL estimates compared to ICC’s computed from BILOG estimates ICL BILOG Comparison 6 This resulted in hundreds of comparisons. We summarized by computing the mean and standard deviation of the RMSE statistic across the 50 items. Presented in Table 2, the comparisons of ICL estimates to truth (IT) and of BILOG estimates to truth (BT) are very similar in size and larger than the comparison of ICL estimates to BILOG estimates (IB). This seems to indicate that both programs are equally good at recovering item parameters and that they actually achieve comparable results (as opposed to disparate results that have a comparable accuracy). This is hardly surprising given the similarity of the core estimation algorithms. Table 2 does not indicate any interaction between sample size and recovery accuracy. Discussion At their heart, ICL and BILOG both estimate the parameters of dichotomous IRT models and they perform this function using similar algorithms and in a uniformly accurate manner. Slight differences in the second or third decimal of the RMSE statistic are unlikely to make any practical difference and can probably be ignored. While both programs offer many similar features, BILOG is clearly the more mature product with a number of practical advantages such as: additional features, greater ease-of-use, and professional support. In any serious analysis of real data, BILOG has a decided advantage in terms of the rich information about model fit provided in the Phase 2 output. Indeed, most examination programs that rely upon BILOG today would find ICL’s default output to be inconveniently sparse. ICL offers some users greater power. For example, ICL can be used to generate simulated item responses for research. And ICL offers built-in boot-strapping functionality for empirically estimating the sampling distribution of item parameter estimates. In some research using simulated data, model fit may not be an issue and ICL may be more convenient than BILOG. For those who grasp Tcl, ICL offers a unique opportunities to extend ICL to include additional functionality. In the form of ETIRM, ICL offers a unique opportunity for programmers to build professional-grade IRT parameter estimation into item banking software, simulation studies, and other applications. Finally, because ICL incorporates both dichotomous and polytomous models, ICL may be a simpler solution to the estimation of mixed-format examinations. Limitations. We are currently evaluating the theta-hat estimates produced by the two programs; our talk will report those results. We also plan comparisons for multigroup situations. In addition, this study compared data generated from the 3PL with a normally-distributed theta density—that is, perfectly coincident with the assumptions of the IRT model and the estimation software. We are currently exploring the effect, if any, of using non-normal theta densities. ICL BILOG Comparison 7

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Item Response Modeling With BILOG-MG and MULTILOG for Windows

Item response theory (IRT) has become one of the most popular scoring frameworks for measurement data. IRT models are used frequently in computerized adaptive testing, cognitively diagnostic assessment, and test equating. This article reviews two of the most popular software packages for IRT model estimation, BILOG-MG (Zimowski, Muraki, Mislevy, & Bock, 1996) and MULTILOG (Thissen, 1991), which...

متن کامل

Effects of Initial Values and Convergence Criterion in the Two-Parameter Logistic Model When Estimating the Latent Distribution in BILOG-MG 3

Parameters of the two-parameter logistic model are generally estimated via the expectation-maximization algorithm, which improves initial values for all parameters iteratively until convergence is reached. Effects of initial values are rarely discussed in item response theory (IRT), but initial values were recently found to affect item parameters when estimating the latent distribution with ful...

متن کامل

Separate Versus Concurrent Estimation of IRT Item Parameters in the Common Item Equating Design

DOCUMENT RESUME TM 030 621 Hanson, Bradley A.; Beguin, Anton A. Separate versus Concurrent Estimation of IRT Item Parameters in the Common Item Equating Design. American Coll. Testing Program, Iowa City, IA. ACT-RR-99-8 1999-12-00 36p. ACT Research Report Series, PO Box 168, Iowa City, IA 52243-0168. Reports Evaluative (142) MF01/PCO2 Plus Postage. *Equated Scores; Estimation (Mathematics); *It...

متن کامل

Issues Affecting Item Response Theory Fit in Language Assessment: A Study of Differential Item Functioning in the Iranian National University Entrance Exam

This study aimed at examining the issues affecting the use of IRT models in investigating differential item functioning in high stakes testing. It specifically focused on the Iranian National University Entrance Exam (INUEE) Special English Subtest. A sample of 200,000 participants was randomly selected from the candidates taking part in the INUEE 2003 and 2004 respectively. The data collected ...

متن کامل

208-2012: How Test Length and Sample Size Have an Impact on the Standard Errors for IRT True Score Equating: Integrating SAS® and Other Software

The standard error of equating is a useful index to quantify the amount of equating error. It is the standard deviation of equated scores over replications of an equating procedure in samples from a population or populations of examines. The current study estimates the SE of item response theory true score equating in the Nonequivalent Groups with Anchor Test design using simulations. Specifica...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008